Storage and Retrieval of Highly Repetitive Sequence Collections

نویسندگان

  • Veli Mäkinen
  • Gonzalo Navarro
  • Jouni Sirén
  • Niko Välimäki
چکیده

A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive se...

متن کامل

Molecular Typing of Mycobacterium Tuberculosis Isolated from Iranian Patients Using Highly Abundant Polymorphic GC-Rich-Repetitive Sequence

Background: Tuberculosis (TB) with more than 10 million new cases per year and one of the top 10 causes of death worldwide, is still one of the most important global health problems. Also, multi drug-resistant tuberculosis (MDR) is a serious danger to public health. Understanding of the epidemiological pattern of mycobacterium tuberculosis (MTB), Estimates of recent transmission and recurrence ...

متن کامل

Document Listing on Repetitive Collections with Guaranteed Performance

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of sizeN over alphabet [1, σ] is composed of D copies of a string of size n, and s single-character edits are applied on the copies. We introduce the first document listing index with size Õ(n+s), precisely O((n lg σ+s lg2...

متن کامل

Storage and Retrieval of Individual Genomes

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically h...

متن کامل

Indexing Highly Repetitive Collections

The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 17 3  شماره 

صفحات  -

تاریخ انتشار 2010